Scaling out data preprocessing with Hive

نویسندگان

Gábor MAKRAI

Zoltán PREKOPCSÁK

چکیده

We introduce a user-friendly graphical data preprocessing application based on Hive, one of the well known opensource distributed warehouse systems. It is comfortable, and easy to use for preprocessing purposes, but to prove usability of this application, we created measurement a framework to ensure precise results. These results show that our application has outstanding scaling capability in the case of increasing data amount and increasing number of applied computers. We conclude that this is a good solution for medium and large scale data preprocessing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Quality for Web Log Data Using a Hadoop Environment

Solving data quality problems is important for data warehouse construction and operation. This paper is based on developing a web log warehouse. It proposes a data quality problem methodology for data preprocessing within the log warehouse. It provides a hierarchical data warehouse architecture that is suitable for resource saving and ad hoc requirements. The data preprocessing is completed usi...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

ARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive

Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...

متن کامل

The HIVE Tool for Informed Swarm State Space Exploration

Swarm verification and parallel randomised depth-first search are very effective parallel techniques to hunt bugs in large state spaces. In case bugs are absent, however, scalability of the parallelisation is completely lost. In recent work, we proposed a mechanism to inform the workers which parts of the state space to explore. This mechanism is compatible with any action-based formalism, wher...

متن کامل

Execution Primitives for Scalable Joins and Aggregations in Map Reduce

Analytics on Big Data is critical to derive business insights and drive innovation in today’s Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex j...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Scaling out data preprocessing with Hive

نویسندگان

چکیده

منابع مشابه

Data Quality for Web Log Data Using a Hadoop Environment

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

ARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive

The HIVE Tool for Informed Swarm State Space Exploration

Execution Primitives for Scalable Joins and Aggregations in Map Reduce

عنوان ژورنال:

اشتراک گذاری